MSDS696 Data Science Practicum II

Topic: Rain Prediction in Australia using Machine Learning with Python

Date: 21AUGUST2022

Author: Olumide Aluko

Purpose: The project aims to use Rain in Australia Dataset from Kaggle. The problem is to predict whether it will rain tomorrow or not given the weather conditions of today. We will be using Decision Tree Classifier.

image.png

Table of contents:

  1. Problem Statement
  2. Dataset Description
  3. Importing Libraries
  4. Configuration
  5. Import Dataset
  6. Train, Validation, Test Split
  7. Identify Inputs & Target Columns
  8. Identify Numerical & Categorical Columns
  9. Impute Missing Values
  10. Scaling Numerical Columns
  11. Encoding Categorical Columns
  12. Training & Visualizing Decision Trees
  13. Feature Importance
  14. Hyperparameter Tuning - To Reduce Overfitting
  15. Results and Conclusion

1. Problem Statement:

Predict next-day rain by training classification models on the target variable using the Australia Rainfall data.

Solution: Design a predictive classification model (Decision) using machine learning algorithms to forecast whether or not it will rain tomorrow in Australia.

2. Data Description:

Dataset Source: https://www.kaggle.com/code/ankitjoshi97/rainfall-in-australia-eda-prediction-89-acc/data

The dataset is taken from Kaggle and contains about 10 years of daily weather observations from many locations across Australia.

Dataset Description:

3. Import Libraries

Let's import the necessary libraries.

4. Configurations

Lets set some configurations needed for matplotlib, seaborn and pandas.

5. Import Dataset

Let's download the dataset and import it using pandas function read_csv().

Let's look at the info of the dataset,

There are 145460 samples out of which there are 142193 samples whose 'RainTomorrow' column is non-null. Therefore, we can just remove the rows in which the 'RainTomorrow' column is null since there will be no significant information loss.

Checking the Dimensions of Dataset: The shape property is utilized to detect the dimensions of the dataset.

print(df_rain.shape)

Summary of a Dataset:

Let’s generate descriptive statistics for the dataset using the function describe() in pandas.

Descriptive Statistics: It is used to summarize and describe the features of data in a meaningful way to extract insights. It uses two types of statistic to describe or summarize data:

Measures of tendency Measures of spread

From the above descriptive statistics, we can deduce the following:

The statistics displayed for the attributes of 'object' datatype is different from the one displayed for numeric datatypes. Some of the conclusions drawn from the above table are:

Observations:-

6. Exploratory Data Analysis and Visualization

Correlations

Let’s see if we can pull out some correlations between locations based on temperature and rainfall. We do get ~(4-5) clusters of locations with similar rainfall patterns: Sydney region (Sydney, Penrith, Richmond, etc.), Perth, Central Australia and Southern Australia (Melbourne, Tasmania).

Observation:-

Let us try heatmap of correlation between all features in the dataset

From above heatmap of correlation, we can see that there are a few features which are impacting other and can be termed as positively correlated

Observation:-

Observation:-

Observation:-

Observation:-

Observation:-

Observation:-

Observation:-

Observation:-

sns.boxplot(x = "MinTemp", y = "RainTomorrow", data = df_rain, dodge = True);

From the above graph Raintomorrow with "Yes" has the highest humidity of 100 when Temp3pm is at 20.7 celsius.

From the above graph Raintomorrow with "No" has the highest count of 1278 when Temp3pm is between (19.2 - 19.3) celsius.

Next, I plotted a count chart of whether it rained the next day.

The graph shows that the days of not raining is more than 4 times more than rained in the next. Hence, there is a class imbalance and we have to deal with it. To fight against the class imbalance, we will use here the oversampling of the minority class. Since the size of the dataset is quite small, majority class subsampling wouldn’t make much sense here.

7. Train, Validation, Test Split

Lets use time series data, since it is a collection of observations obtained through repeated measurements over time. Plot the points on a graph, and one of your axes would always be time.

The given data is a time-series data and is in chronological form. While working with chronological data, it's often a good concept to separate the training, validation and test sets with time, so that the model is trained on data from the past and evaluated on data from the future.

Lets use the data till 2014 for the training set, data from 2015 for the validation set, and the data from 2016 & 2017 for the test set.

To archieve this,

8. Identify Inputs & Target Columns

The columns other than RainTomorrow are independent columns (input columns) while the RainTomorrow column is dependent column (output columns).

Identify inputs and outputs

X_train : Training data's inputs Y_train : Training data's output Equally for validation and test data.

9. Identify Numerical & Categorical Columns

From the information of the dataset shown above, the Dtype column specifies the datatype of the column values. Separate preprocessing steps are to be carried out for categorical data and numerical data. Hence we'll identify the columns which are numerical and which are categorical for preprocessing purposes.

Remove rows for which target column is empty

10. Impute Missing Values

As we have discussed already that preprocessing steps are to be done separately for numerical and categorical columns. First, let's impute the numerical columns with mean of the corresponding columns.

Below code displays the counts of null values in numerical columns sorted in descending order.

Below code imputes the numerical columns with their mean respectively.

Now, after imputing the null values with mean, the count of null values are:

11. Scaling Numerical Columns

Let's learn the importance of scaling before proceeding. Feature Scaling is a method to standardize the independent attributes present in the data in a fixed range. It is done during the data pre-processing to handle highly varying magnitudes or values or units. If feature scaling is not done, then a machine learning algorithm tends to weigh greater values, higher and consider smaller values as the lower values, regardless of the unit of the values.

12. Encoding Categorical Columns

Let's now learn what is encoding and why it is needed? Encoding categorical data is a process of converting categorical data into integer format so that the data with converted categorical values can be provided to the models to give and improve the predictions.

Every machine learning models learns only from numerical data which is why it is needed to convert the categorical data to integer format during preprocessing.

The categorical columns in our dataset are,

Before encoding the categorical columns one must be sure sure that there are no null values in those columns because those columns will also be encoded which doesn't make sense. Hence, the null values in categorical columns should be imputed before encoding the columns. This is similar to imputing numerical columns followed by scaling them.

Below code displays the count of null values in the categorical columns:

Imputing is done by considering mean in numerical columns. But this is not the case for categorical columns. For categorical columns either mode can be considered or some other dummy value can be substituted in place of null values. Here, let's substitue 'Unknown' in place of null values.

This can be archieve as follow:

Now the counts of null values are:

After imputing the null values let's perform encoding.

Let's combine the preprocessed numerical and categorical columns for model training.

13. Training & Visualizing Decision Trees

A decision tree in machine learning works in the same way, and except that we let the computer figure out the optimal structure & hierarchy of decisions, instead of coming up with criteria manually.

Being a classification task, let's use DecisionTreeClassifier algorithm.

Training

We have trained our classifier with the training data.

Evaluation

To review the training process, let's check how well the model trained with the training data.

The counts of predicted result shows that our model has predicted more 'No' for the target column RainTomorrow than that of 'Yes'.

Now, let's calculate the accuracy of our model in the training data.

Interesting! The training set accuracy is close to 100%. But we can't depend completely on the training set accuracy, we must evaluate the model on the validation set too. This is because our model should be trained in a generalized way i.e, it should be able to predict output which is not present in training data.

Let's also calculate the percentage of 'Yes' and 'No' in validation data.

The above result shows 78.8% 'No' and 21% 'Yes' in validation data. This proves that if it is predicted 'No' for all the validation data, it would still be 78.8% accurate in the result (since there are 78.8% 'No' in the validation data). Hence, our model should remain learning only if it exceeds 78.8% accuracy because even predicting 'No' always using a dumb model gives 78.8% accuracy.

Summary

DecisionTreeClassifier with default parameters

image.png

The above case was an overfitting case as tree used the max depth and memorized the values and failed to predict with low accuracy of 79.28% for test and validation dataset

Visualization of Decision Tree

14. Feature Importance

The initial 23 columns or features after encoding became 119 features. Decision Trees can find importance of features by itself. Below are some of the importances of 119 features(total number of features in the training dataset).

Note: Only some feature importances are displayed but the above code displays for all features.

Let's view importances of top 10 features.

15. Hyperparameter Tuning - To Reduce Overfitting

Now that we found out our model is only marginally better than a dumb model because of overfitting, we should modify some of the parameters of DecisionTreeClassifier to reduce overfitting.

The DecisionTreeClassifier accepts several arguments, some of which can be modified to reduce overfitting.

max_depth max_leaf_nodes By reducing the tree maximum depth can reduce overfitting. Maximum depth (default) is 48 which is reduced to 3 to reduce overfittting as below.

Hyperparamter tuning

Our model had 100 % training accuracy which means that model is memorising the inputs. Comparing it with validation and test accuracy of approx. 79.28 % we clearly see a case of overfitting. We need to try and make some changes in the parameters of model training to avoid overfitting. One possible way of doing it is to reduce the max depth of the tree. Let us train the model again

Let us score the model on training, validation and test dataset again

As we can see the training accuracy is just 83% which means the model is not memorising and overfitting the values. Let us try the same for validation and test dataset

We now have a significantly better performance on training and test dataset Let us get the confusion matrix

Tuning max_depth

Since the max_depth value without manual constraint for which our model overfitted is 48. And the max_depth value can't be 0 or lesser. Hence, let's find what the best value of max_depth would be by trial and error method and use the max_depth for which the errors of train and validation dataset is optimal.

From the dataframe above, it can be seen that the training accuracy increases with increase in max_depth. Also, it is noted that validation accuracy first increases and then decreases.

Tuning Graph

Let'us visualise the training accuracy and validation accuracy with different max_depths.

From the graph it can also be seen that training accuracy increases with increase in max_depth while validation accuracy first increases (till max_depth = 7) and then decreases. Hence, optimal max_depth is 7.

Build Decision Tree with max_depth = 7

Tuning max_leaf_nodes

Another way to control the size of complexity of a decision tree is to limit the number of leaf nodes. This enables branches of the tree to have varying depths. Let's limit the number of leaf nodes to 128 at maximum.

Let's see the accuracies when max_leaf_nodes was set to 128 at maximum.

Now, let's train our DecisionTreeClassifier with max_leaf_nodes = 128 and max_depth = 6,

Let's now use the trial and error method considering the two parameters.

Tuning Graph

Let'us visualise the training accuracy and validation accuracy with different max_depths and max_leaf_nodes = 128.

It seems max_depth = 9 and max_leaf_nodes = 128 is the optimal hyperparameters

Now, let's train our classifier with the best found hyperparameters,

DecisionTreeClassifier with max_depth = 9

image.png

The above performance is considerable better for new predictions as accuracy of training data, test data and validation data is almost the same.

Decision Tree Classification Confusion Matrix

16. Random Forest Algorithm Training

Ramdom Forest is an ensemble technique where

Multiple DecisionTrees will be trained with different hyperparatmers Outcome of each DecisionTree will be voted / averaged The one with most count in terms of Classifier will be the winner prediction

Let us now get the score of model for train, test and validation dataset.

Random Forest Classification Confusion Matrix

From the above confusion matrix, there are 19,009 true negative values, 3,159 false negative values, 1,019 false positive values, and 2,787 true positive values. This illustrate that the best model is the Random Forest model.

The notebook also has an implementation of RandomForest with training accuracy 99.99% and a validation accuracy of 85.58%.

Finaly, one can establish that the Random Forest model is better in the sense it yields higher accuracy than other models.

17. Conclusion:

For the decision tree model, the training accuracy is 99.99%, validation accuracy is 79.28% and the percentage of 'No' in validation data is 78.8%. Hence, our model is only marginally better than always predicting "No". This occurs because the training data from which our model learned remains skewed towards 'No'Decision tree overfit.

After an Hyperparamter tuning was applied to make some changes in the parameters of the model training to avoid overfitting. We were able to predict with a training accuracy of 84.89% and validation accuracy of 84.46% using DecisionTree.

Sklearn best understands the value of hyperparameters but it sometimes fail for specific use cases and leave it up to Data scientists to tune the hyperparamters. DecisionTree and RadomForest are always at a risk of overfitting.

Also, the notebook has an implementation of RandomForest with training accuracy 99.99% and a validation accuracy of 85.58%. The Random Forest model has the highest validation accuracy among these two with an approximately 86.0%. From the performance of the two models, Random Forest is greater than Decision Tree.

Finaly, one can establish that the Random Forest model is better in the sense it yields higher accuracy than other models.